1. Introduction
1.1 Data loading
1.2 Data scaling
2. Broad analysis
2.1 Boxplots
2.2 Densityplot
2.3 K-means clustering
2.4 PCA
3. Specific analysis: Lapatinib
3.1 Analysis of biomarker and t-test between treated and untreated cells
3.1.1 Expression of biomakers before and after the treatment of Lapatinib
3.1.2 Searching for our own biomarkers
3.1.3 Boxplot of our biomarkers
3.1.4 ggBoxplot of our biomarkers’ basalexpression in breast cells with Kruskal-Wallis test
3.1.5 Heatmap of biomarkers in breast cells
3.2. Paired t-test between the Lapatinib treated und untreated data
3.2.1 Paired t-test of our biomarker before and after treatment
3.2.2 Visualization of the genome-wide paired t-test in a vulcano plot
3.3 K-means
3.4 Spearman correlation
4. Main questions
4.1 Question 1: Does the cell doubling time correlate with reduced drug sensitivity?
4.1.1 Linear regression ~ Inoculation_Density
4.1.2 Linear regression ~ Doubling-time
4.1.3 Multiple regression
4.1.4 Table of conclusion
4.2 Question 2: Does Lapatinib have a similar effect on lung cancer as Erlotinib?
4.2.1 Heatmap
4.2.2 Plot Erlotinib and Lapatinib, colored by tissue
4.2.3 Pearson correlation
4.2.4 Lung cellines
4.2.5 Anova
4.3 Question 3: Does Lapatinib have a similar effect on brain metastases as it does on breast cancer?
4.3.1 Volcano plot
4.3.2 Heatmap for visualization
4.3.3 Investigation of correlation
Our project is about Lapatinib and divided into three parts: Broad analysis, specific analysis and our three main questions. The main questions are the most important part of our project. They address whether cell doubling time correlates with reduced drug sensitivity and the effect of lapatinib on lung cancer and brain metastases. Twenty-five per cent of breast cancers overexpress ErbB2 / HER which leads to a more aggressive phenotype. Lapatinib blocks tyrosine kinases that belong to the HER2 receptor, which is increasingly found on the cell surface of cancer cells. HER2 is a receptor for the epidermal growth factor (EGF), which stimulates the cell division of cancer cells. By blocking these HER2 receptors, the unregulated growth of cancer cells can be controlled again. Lapatinib binds the ATP-binding site of the receptor’s intracellular domain and inhibits the survival and proliferation pathways of the tumor cell.
After checking for normalization, we scaled our data in the first place to provide the scaled data for further analysis.
As the very first step in our data analysis, we checked the normalization of the data.
In order to check whether there is a connection between the various drugs, we colored the plot by drug.
As shown, there is a connection between the different “boxes” and the different drugs. This could be the reason for different batches so we scale the data.
As you can see the scaling is necessary to compensate the batch effect. Scaling is then also applied on Untreated before it is further analyzed.
As an other method to get in touch with the data, we checked if there is a difference between the treated and untreated genexpression. The abline shows the 3 quantiles ( 25 % 50 % 75 % )
As assumed, the expression patterns don’t differentiate much.
To look for clusters in the raw data we performed a k-means clustering and searched for potentially clusters using the most variable (>75 % Var), and thus the most informative genes.
Running a loop for the best cluster-amount (searching for an “elbow”)
As we wanted an “elbow”, it is hard to determine the amount of clusters.
Since the clustering method by performing k-means clustering was not successful, we performed a principal component analysis for dimensionality reduction as another method for identifying clusters or patterns while preserving as much information of the original data set as possible.
pca <- prcomp(t(Fold_Change), scale = TRUE)
By plotting a scree plot of the variation percentage, we can display how much variation a principal component captures from the original data.
As seen in the scree plot, the first three or four PCs have capture most of the information since the curve bends at an “elbow” at PC4 - this is our cutting off point.
To determine clusters of samples based on their similarity, we plotted PC1 and PC2 as well as a PC2 and PC3 while coloring by tissue type and drug respectively.
#plot PC1 and PC2 colored by tissue
plot(pca$x[,1],
pca$x[,2],
col = color_tissue,
pch = 19,
xlab = paste("PC1 (",pca.var.per[1],"%)"),
ylab = paste("PC2 (",pca.var.per[2],"%)"))
As we can see, in contrast to the k-means method, clusters can be determined. Respecting the color annotation, it is clear to say that the samples cluster according to drug treatment.
Unfortunately, we can not use them as biomarkers because expression either decreases only slightly or even increases. So we searched for our own biomarkers.
## GDF15 NUPR1 DDIT3 TRIB3 ATF3 ASNS
## 0.7150148 0.6843715 0.5794749 0.5753359 0.5603209 0.5305063
GDF3 Growth Differentiation Factor 3 belongs to the transforming growth factor beta (TGF-beta) superfamily and has some intrinsic activity.
NUPR1 stands for nuclear protein, transcriptional regulator, 1. It regulates the transcription during stress signals by empowering cells with resistance to stress.
DDIT3 encodes for the DNA damage-inducible transcript 3 protein, a member of the CCAAT/enhancer-binding protein which is a transcription factor. It is important for the response of cell stresses. The DNA damage-inducible transcript 3 protein induces cell cycle arrest and apoptosis in response to ER stress.
TRIB3 stands for Tribbles Pseudokinase 3. The kinase is induced by the transcription factor NF-kappaB. It is essential for cell proliferation.
ATF3 stands for activating transcription factor 3. It belongs to the mammalian activation transcription factor/cAMP responsive element-binding (CREB). ATF-3 is induced by physiological stress.
ASNS stands for Asparagine Synthetase: The enzyme catalyzes the production of the amino acid L-asparagine.
ggboxplot(df_breast, x="Biomarkers", y="Expression", color = "Biomarkers",
add = "jitter", legend = "none")+
rotate_x_text(angle = 45)+
stat_compare_means(method = "kruskal.test", label.y = 15)+ # Add global kurskal wallis test p-value
stat_compare_means(label = "p.signif", method = "t.test",
ref.group = ".all.", hide.ns = TRUE) # Pairwise comparison against all
##
## Paired t-test
##
## data: before and after
## t = -1.5497e-13, df = 718145, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0003680021 0.0003680021
## sample estimates:
## mean of the differences
## -2.909656e-17
## Marker pvalue
## 1: GDF15 2.612447e-12
## 2: NUPR1 1.327187e-11
## 3: DDIT3 9.979311e-20
## 4: TRIB3 6.995573e-20
## 5: ATF3 2.142510e-12
## 6: ASNS 2.207251e-14
The silhouette-plot can be used to determin the amount of clusters. The main value, the average silhouette width is an indicator of the chosen amount of clusters (higher values indicate better clusters).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007514 0.0090894 0.0138573 0.0197222 0.0226351 0.4897715
## [1] 3325 54
The negative bars indicate an incorrectly assigned datapoint. Overall, the fewer clusters you have, the better your silhouette width values are. This is due to the datapoints being very proximate to one another and not being able to define specific or different clusters.
To determine clusters of samples based on their similarity, we plotted PC1 and PC2 as well as a PC2 and PC3 while coloring by microsatellite instability status and drug respectively.
Modeling of a linear relationship between cell doubling time and inoculation density.
# Spearman correlation
cor(Cellline_Annotation$Doubling_Time, Cellline_Annotation$Inoculation_Density, method = "spearman")
## [1] 0.5674381
##
## Call:
## lm(formula = Cellline_Annotation$Inoculation_Density ~ Cellline_Annotation$Doubling_Time)
##
## Coefficients:
## (Intercept) Cellline_Annotation$Doubling_Time
## 9110.9 169.4
As you can see from the plot, there is a positive correlation between inoculation density and doubling time. This means that the higher the inoculation density, the higher the doubling time, which suggests that the tumour grows slowly because the cells need more time to divide.
As mentioned in our presentation, we want to create a model to predict GI50-values and thus to predict, if Lapatinib is a good choice. Therefore, we took several datasets and performed three regression analyses.
The first linear model tries to predict the GI-50 value among the data of the Inoculation density.
The second linear model tries to predict the GI-50 value among the data of the Doubling-time.
Finally, we did a multiple regression with both datasets to predict GI50-values.
The following table shows important values.
| Inoculation_Density | Doublingtime | Multiple regression | |
|---|---|---|---|
| R-squared-value | 0.0020077 | 0.0155763 | 0.0469458 |
| F-statistic | 0.0925404 | 0.7278459 | 1.0097948 |
| RMSE of test-dataset | 0.9140916 | 1.0247092 | 0.9413304 |
The data is hard to discuss, because we used random values for our test and training dataset. Hence, different values on every run are obtained. We could use “selected values” (First 20) but this would be kind of cheating. But overall, the models do not seem to be appropriate for good linear regression.
As we read in the paper “Antitumor and antiangiogenic effect of the dual EGFR and HER-2 tyrosine kinase inhibitor lapatinib in a lung cancer model. (Diaz et al., 2010)”, Lapatinib and Erlotinib are said to have similar effects. To verify this, we first looked at the correlation of the drugs and then created a graph showing the difference between GI50 for a certain cell line and the middle GI50 for both drugs.
Erlotinib and Lapatinib have a strong correlation.
The difference from the GI50 for a particular cell line and the mean GI50 is plotted here for Erlotinib and Lapatinib.
## [1] 0.6528188
A Pearson correlation coefficient of ~ 0.65 confirms that these patterns are similar. One reason for this is that both Erlotinib and Lapatinib are EGFR inhibitors. Cell lines that were more sensitive are displayed as bars that project to the right of the mean. Cell lines that were less sensitive are displayed with bars projected to the left.
To answer our more specific question, whether Lapatinib also has an effect on lung cancer, we will only look at the cell lines in lung cancer.
The difference from the GI50 for cell line from lung tissue and the mean GI50 is plotted here for Erlotinib and Lapatinib.
## [1] 0.9609488
A pearson correlation coefficent of ~ 0.96 suggests that Lapatinib has a similar effect on lung cancer as Erlotinib.
##
## Fligner-Killeen test of homogeneity of variances
##
## data: log2FoldChange by Drug
## Fligner-Killeen:med chi-squared = 0.75387, df = 1, p-value =
## 0.3853
As the paper “Lapatinib Distribution in HER2 Overexpressing Experimental Brain Metastases of Breast Cancer” (Taskar et.al., 2012) suggests, Lapatinb crosses the brain-blood barrier. Since breast cancer tends to spread and form metastases in brain cells and Lapatinib is used in anti-breast cancer therapy, we wanted to analyse our data for similar effects after Lapatinib treatment in breast and cns tissue cells.
At first, we selected Lapatinib response genes by creating matrices with cns- and breast-cell line fold change values.
Next, we performed a paired t-test of treated and untreated cns and breast samples, respectively, to determine statistical significant fold change values. Since we are performing multiple testing over different cell lines from breast and cns tissues, we want to adjust our p value by performing a Benjamini Hochberg correction.
For visualization of the biological significance (fold change values) and the statistical significance (adjusted p-values), we plotted our data in an interactive volcano plot.
To visualize the expression patterns of the top peak genes present in both tissue types we calculated a heatmap, comparing the expression profiles of the two tissue types.
The heatmap shows, that the top expression genes are regulated similarly in breast and in cns cells after treatment with Lapatinib.
At last, we investigated the expression regulation in both tissue types by calculating the pearson correlation between the fold change values of the top peak genes.
## [1] 0.921812
The obtained correlation value of ~ 0.92 indicates that Lapatinib treatment yields a similar effect on gene expression in breast and cns tissue cells and therefore might be possible to treat cancer in brain metastases.